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1 

b) mflE^A— 19 b fob* glWr^xh^^t 

m&t&nx-fflfi? b/^m^x . mm^xb&m 

n&SlcOT^XbzS-ry/WJyxfyxX'h&ti^gfr 10 

[Ifc^P^Sriagj] 
[000 1] 

*. 

[0002] 

/wi, tMt* h h ? (® a ) mFScM&<*vtwm 

U Ztlt><D c r : *XbzS*yjl>(D&xl l Zt$W&b*Z'y'? 

<7>jmm&ffiizfc<mm&iT$>z>. mz. 

•y 7i,zmm-&±mcr>T-*x h<r>mt *) at'tih 1 o± 

t ^<7m<Dm^m:wmM±mmT a ^oyfrmz 30 
i±m®%m&tmKm:tm$imz$mizm& tx?m 

a^ffiJKLfcO, /Mtt^v-H^<-. ZL.xmz'*.- 

<n£ottm\,z{m^&m^ffi&*ifrm~fh%m 

w mm. %m\,±h&7*yv*: rA-H-^-x 
(m& ■ mm ■ mvmm%t'i l zmt&~x.~-x) ^ <dm. 
a uz®m lxmo? *yh zftmnfLm uzmm vtz 

l«?il!l«(paperstock){:J:-j 50 
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2 

mZ^-iScou-f TV V tzM-rS^X ES'JL*: Wit 

-ffcfc J: -5tf^ h&ttttff ^/WcHBl-r* 
[0003] yi^'JXhfcX^ijX 

^ftAmmewwiRVftmizftfrtix v*s £ t £&b* 
f^yMi. $M&Mm^mz~r&ffitfi£jmt 

upercol 1 ider) XfcH'tfl'^fcOT 
[0004] Kfii-C. fflffg^&V-rdrX h^Hioef 

?6t*J9f^$n^o^a*tt2o*-&. mi fe, 
ia»fit^i!fto^^#ii. f a t jaumtfiwwaB 

mmmnT-f^-xizM&ZmxX^*:. Why 
^MBtO«^1t*55£^$it-& . h br -y ^-x?>^-y- 

[0005] y»-^Mor7'J^-y3 ^ (ffl^) 

■c* . Jii»?aa»^^wi^>/i*: <fc i-xiMizm 

\±mzi;*r yjwmt)>t>m£izn& ztwx'%&„ 

[0006] 9m<7)ttm(7)ffimiT <J* h-r uxit$ 



3 

— <5D«fc a^^^^WSr (paral iterary) 

[ 0 0 0 7 1 T^f X h<^> y^ScOfi-ffc^fcflr* 
WfcWii** (Doug 

las B i b e r ) *»JBfcT** . «<0W5£t±, JilT 
£"&t? :" Spoken and Written T 
extual Dimensions in Engl 
ish: Resolving the Contra 
dictory Findings" (Languag 
e. 62 (2) : 384-4 13, 1986) ;"Va 
riation Across Speechand 
Writing" (Cambridge Univer 
sity Press. 1988) ;"The Mul 
tidimensional Approach to 
Linguistic Analyses of G 
enre Variation: An Overvi 
ew of Methodology and Fin 
ding" (Computers in the Hu 
inanities. 26 (5-6) : 331-347. 
1992) Using Register-Div 
ersified Corpora for Gene 
ral Language Studies" (Usi 
ng LargeCorpora > fl79-202I 
(Susan Armstrong WM) (199 
4)) ;Si;7<MyEdward Finega 
n ) h^^<D" Drift and the Evol 
ution of Engl i shSty 1 e : A H 
istory of Three Genres" (L 
anguage, 65 (1) : 93-124, 198 

9) . *w j^-awsa&B&frch*) * &*mm-*h 

wmtLx^z. ztLt><7mmt,zi$. MH&k&mn 
wm. m&pm&tr wh- a»&*&a*KRrcrir# 
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4 

ffif&mttizm*)mxxm^x$fi:wmmf& (Mx.&. 
MWvs^ui^jj f^yxa^ffcU) Zikft 

ts^fcfcj;^ -mM%mmxtew%*mmizM*) a 
hiki. zix^<^mi,z%<vsmmz®.~?xm'Mtztiz> 

h£Li$LX$>t>«P&T*Ab1J s %im?>A3Tii. *co 
10 if+yMi k'&< <rmm* i rL& i>(r)X'lS.%^i&&ffh 

mitt&h^ h^vphm^^m^mm-h^ytv 

[0008]*-;^y(Jussi Karlgr 
e n ) BX/jf 'y^fj y? (Douglass Cutt 
ing)ti< "Recognizing Text G 
enres with Simple Metric 
Using Discriminant Analys 
is" (Proceedings of Colin 
g' 94. mi mi 07 1-10751. 199 

20 4^8^) fcistvc. ws-<?>i£m<7)— m^'J^yji- 
<?>&mimtz.im-t&tzib<r)mj]$:m^x^&. &t>i> 

^ya—^^tP^HkubX^h. 7 ; 7 1 7ya—J*z.$:$.b 
frteA«teZcr>^£Wfom%i>ntj£'<X^&i!j s , Wi 

^{±tsi5j:-ec-r^u. ij~-)vy\sy&x}i]~r c T4 y 

30 &rtemzists. as^ii. ^mu^xux^u^co 
nm&mmLKw zco2Aom%i±, mwm*m^ 

XT*xVZm«%&.<?>ilT~-3V-te--!imt&. x-iv 
?\syRV* yf a y7tf&mx*m*)^xtz* : rzty 

-nmzmv^mom&zm^fiit*. smmzntz* 
T*v-t&mx'frmLtz#T*o - 1 <7)isoii^i± 

t}Tzfv-zwmm-z>zki / z£^x. ®.t>i±mm£&. 
ftUfc. x-ji-fuymftrvr-iyfi*, £<*>«£a& 

40 ax ts o . i^Toi a tc^x ^-s : r smmznt:* 

v-wr-?{z£^x^zixx\,^t\^mkx'wm 

m-&c\-k tmmxhhi%&< zti^xr-zfv—im 

'ivr&v *m Aiztt lx mm-tz ^ t u < =5r o a s . 

tu^ac:tT'S)&. j Jgt-. r^^— ;s-x<7)?tw« 
^.S^^rf "J - 1 £<?>mm—Wrt-&frtfWt>frX'lt%: 
50 [0009]fy^(Geoffrey Nunb 



(4) 
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e rg) RWjXV (Pa t r i z i a Viol 
i ) (4. "Text, Form and Genre" 
(Proceedings of OED' 92, ^1 
18-12 2H. 1 992^1 OH) fc:i5^vt\ i/^V 

?\,zMmxbhzk*7m.Lx^h. zixh<r>m%\$. 

[00 10] 

*t3--<7M\ -by htfr-^Xhlz&^XZ&k-t&mWLi: 
3tt. :ro-fc.y*N4. ^jL-^fh/Uk. mi 

coy-* x h zs* r yM&mi-&nafffi'<9 vtvt 
wc s f ^ ^ l (Of ^fx f yx?y 

[00 11] 

[Wfls>Stllw>#«] Hi«±, o o£0!*T?-£,r 
t izj: ^x*wmiimti i '*it>ti& aytww 

a (c-r* ,Mioofcj:-»T. x bowfiftfr , 

^IgfOXf-Sy^ (ISSfcfl:: word stemmin 

XJ47 r-fc y h fclW4«*fttt^ N/Pt fcffil vt 
^lOOtt. 0 4 £IHiiLTf¥ 1X1493$ 

»±. H3fcR«l/C»L<l»W3rt.*. 

[oo 1 2] a. r^xh^-vy/^eiseg^^-r?, 

tgr^-so&tXiOO^iOPKiiHS^-Sra^ 



^%Sr^fi : -r« > r7>'tr J L-^>-x-f-A 1 0 tot veil 

1 OtlflHf TfcffiiiWfc**^-** 
— ?12£^tf. Ate. ayea-^^xfAioii/ 

&„ nytWi/^fAl 0(4. "r-^fcAA 1 *-** 

14£}To.ri:fcJ:o*e. 3^Ka-^jL— «i35/K 
j'X-r A 1 0 * * A***"* <r kWXZ 

10 &„ v»^x 1 6 2r»*^^fct:i;-pT. n^e^-^jL 
— 1 2fc^jS3*ufc!tf>f ^^B3W;t3i« 

T^^„ £>t. nVtTjL— ^JL— »F(4. X^-f -5X2 O 
XJjfc^WCWJ^TVy h 1 8&8$&£r£ fctCfcoT 
3^t a -^5/XfA 1 OfcrffiffSfcA;^* £ 

n y t-x * F 9 4 7*2 2 izW £ fc (c J: 0 „ 

fc3&*rS»*. x^t-24K:J:-5t. a^tr 

20 y. MitfAsc i i £dfejjJc-rs.ri: 

[0 0 1 3] 7B*7tl K4. 3^ea-^^fA 
1 bo»fl^)«!«aM«l!frtfvv ayta-^a-f 
<03V>-K£3gfT-r&. ro-fevtl 1(4. ^*)J2 8 
X(4t : -f x ^ F 5 yp^O? a vt-fa? fcrtt?^ 
taE«**iJt#fr5 O&tf 1 0 0&4f«>fl»4-S|Mfr* 
d fc t J: 0 . #J--if wnv^b'£j£g--r&iS«&!&f£ 
JrfUIBrL. i*i**f3. 3ffl©. ^o-b-yHfl 1 otzDbO) 

JBLSffl^^E'J (ROM) , ^VrAT^-b^t'J 
(RAM) , ^fSv^RAM (DRAM) , 7D^ 
(PROM) . M*"Tfl^PROM (E 

prom) mfy^viszL*^ v-%k<?m$mm%&&- 

"TSbSROM (EE P ROM ) ifiif&tl&. 
[00 14] B. f^^^y*. 7T*>yhRXf 

3ytjL-^yXfAl 0(4^5 0X^1 OOdIo 
fiii§##h xfS^> W«fXt4Sitfct>L<{4aPl 

... x'iz&^v\<rDi)*7)B3£*-*-xitimmmzm%fim 
wmkximnmmim (trait) tiot^ 

$^Sx= s fXh0JE<|giS$it^:ffiS0^7X (®i) 
50 fy^T-^XV^mVU-t^^kifiX^h. Wfflm*S> 
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izmm Ztl& XolZ. t-** h is* >M5$L ( -b >-f V 
X) CO is* >/]s<V?>.t,ZW8~? h . HP*>„ f^Xf-y>y 

*xhi?-vyn>k LTfgf8L=Srv^ f^xhy>y;W: 

a n e Austen) (,z£ r>X^iPtl1Z>\>WL<n7 5X 
[0015] ^ 5 O&tf 1 0 0(7X^14. -r^rX h^' 

1. Btt 

2. tfcffito 
3. 
4. 



■msl,zltiz_$>T*rXh<?>77X£m<l-tZ>. lo«77 
ifCDT-^b^Wt-i^r-fe-y h<0^«0^^ 

14. mnry-u^trntjEm^Th^sm^i^ 

tMmo^x V ^> >7l*#*(cJBiir*- 3 ^ i: #T 
£ 4 i: V> -5 f (|j5£?r Lt^X b >7P&t/X-/*- 

(audience) 7fb7 M4„ IK^MCOJ-^X h 

t . «fc *) mhtifzmMiztsn^fiti^x h i: *mm? 

-fe y h-Xt/^ffi-C&S. 7r-fevht42ffiT^:<TC> 
Yes /N o 



i&f#« (igffiw) /lesew (itww) 



5. mm 

7. 

(Brow) 



[0017] »7Tt7 h £jg||LT , ^JKBi:^* 

-tt-rjLie u x h « ^ t -fe y h tsajirr & ^ t aw & . 

x h >-> ytv*femtz><ntz±x <n 7 r -fe -y h 
V*£#gf43:<. f J fXhy»';^f--«7rt-/ h 



1 . 
a . 
b. 
c . 
d. 
e . 

f : 
g. 

h. 
2. 
a . 
b. 



Hft 



$>*) 

Ye s 
No 

No 

$>*) 



40 



$850 



c . 
d. 
e . 
f . 

g- 
h. 
i . 
3. 
a. 
b. 
c . 
d. 
e . 
f . 
g. 
h. 
i . 
4. 
a. 
b. 



Yes/No 
Yes/No 
Yes/No 
Yes/No 
Ye s/No 



7 4?zsa> 



Btt 

mm 
mm 



^# 



Yes 
Yes 
No 

No 
No 

fc.rn.rn 

*>*) 

mm 

No 
No 



Bft 



Yes 
No 

*0 
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c . ~7 < 91/ '3 > No 
[0018] Yl/^ViW? r-fe -y hO^VU-T 

^ffl^^/p^f ^.-cr^MWL (mm) x-ibh* mmu 
ztiMzmmztv*^ : sc/mik ffi 

^L^^S-C^S „ 0H;U:£. tJfc rtSfcr Once 
upon a t i me (0*>Lt?#»L) " Tlr&£ "9. 
WMr?])T<7>mWili r M;M7'J- (Hail Ma 
r y : U TfcEfff & W 0 ) J Tlf&ifc & ♦ fl&C95$ 20 

IS^T-fT-AWHISC^Mii-rS. Hitf, Mr. . Mr 

^^^x^mmizmm^tixa*). r^gj &v r D - 

<l>. 55tC " it' spretty much a s 
nap" *t'«7l/-X^ttMtS^, T^r-flW 

a** V)V<T> ht7 «fc T« 

ttK *<&%ib'(Dnmzte?7'mfi*ntifrxii+fttzM 
^-f Tti. T^xheymfcztFtttbizmzt&ztw 
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1 0 

xmmx^w 5m-t&zt&x'*z*^-<7>mi. 

A. fyWW^ jl— 

1 . (n^vtoigS (#«»M +1 ) 

2. yum (xyv/jt) /mm 

3. Ti&ffl (^yya-XSt) /IB* 

4 . ( ^acxf-3 >"?-?<7>|ggt+ 1 ) 

6. (r-y^AiO^S+l ) 

7. o^(^S3oy«+D 

b. xhv y^^m^^mm^^.- 

1. "and", "but" &t/" so" 

C. ^fcCO^JL- 

1. " Once upon a time..." 

1. "Mr., Mrs." 5rifOII&^ 

2. a»m 

3. Slfti&PI 

4. i&fg" be" <DJ&£ 

5. »-IB0> ^ 

6. 7. ^i:?-«:^i4tri^i^fi 

8. m 

9. io. ®M?4y°RTfh-?>m 

ii. 12. - e d- x-mfrz><&mcr>94y°mfiY-? 
>m 

13. m: 

1 4 - Mq" have" 

15. 1 6 . 'UyyttZcosmnfAyRif] — 
m. 

17. 18. %=gmWi<F>?A~TB&Y-9VW 
19- W i t" 

2 0. 21. ^^V^SIi^&^S^^-f^'&I/ 

2 2. 23. 6^«fc O^^V^M^^'fTai^h-^ 
>« 

2 4. 25. 1 0^±0t^^#ISO^^^"S.^b- 

26. 27. 3oJ:D^>#ll*J(Three+wor 
d Ph fas esf) C0^-f TSt^K— ^^"jgt 

2 8. 29. " 1 y" X^h^lSSM^ATRX/h- 

3 0. >!Sa%S}gR 

3 1. 3 2 . 4*flr< *» 1 leM^-ertsmm^fA y° 
3 3. &fr-?Z 



(7 

1 1 

34. 35. mm.mc7>f'(?RTfb-?y&. 

36. miAimmc?>tt$,m 

37. mABMg&<Oft&PI 

38. wmftcon 

39. O-VSt^ 

4 0. " that" eyfyxfyx 

4 1." wh ich" <M 
4 2. ^2A^^f^SPI 
F. 

l . :fc*>»ogS3&»fe«o»JK (#tS§fc) io 

3. ^M^to^xh^^^hcoM^^^^w 

4. cfc*/j*») /mm 

[ 0 0 1 9 ] *->4 0 OO^f X hon— y^SrMV^V 
;UD^ a— #7 r -fe «y h/x^f X V V* >MzJ: otS 

&fitx*(i4. 7 8«sx*f. Rtucr. wbsfco 
[0020] m%&* jL-mizifn&com^Hif^ 

# *> ? tmt& t . #^ct)7 r-fe «y hX(ix*x h 

fflW LT ? AIS#«»rr 6 WWT'fc -6 -r* x b 

y/l>e>7 r -fe y hffi^^t (i*f HBWfc. i <7>S 

•y KctEoT#^-£i^ftM»£&^-r.M;r(i. 0 
3 dress L-X^th^Wfi^mX'hh . 
[0021] C . A — c7>«;*£r&^6)t#>^W 
EJ3(i. jL-tt^jL-OM^S-St^rSstatx^ 

£££MrCti%<. ^773 2^ 34&l/36(iV^ 
jLT;WT-^rT§n. t*H* 5 0 ox 7^ T(iTtHz /-We: 40 
i-5THff$ixS. ^5 0it HflcX^UXte^a-y 
tr— r << x? h' 7 -r rrt^SBS Uc7o.y f-x 
izimz-£& .1 i: .LIS PRVC + + 

4><9>s nyvz.-?wmx'mmz-£& z tare* h-- 

[0022] ¥%Um3 Oli 1 -fe y h<?)*c x.~-RVffl(?) 
1-fe y hW7r*>y^jlSfRTlft£9. -Ifl^tefflL 
TJ£< f8iS$*L*: 1 -fc >y h^-r^f x h ^> V/US-je^T 
S^t^S. Xf-yr3 2tfcl^5 0-5 5I 

i "9 ^^^^X{i^v>Sc2:*^BJJi:^*iirrffiffl-rs 50 
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Zttfi&m'tt&tW Xf773 2l:fc 
WCffig<08:c7)7r-tr>y h££ggUiHR-C£ JL- 
if \mt>fr<m.<r>-7 r -fe >y h £5£|g WifcfSr/o* 

m<n>TyV*r— i/-\>l£&\^X^imX'$>htzth. jl— 
(id OB&£rC(ix3f X h 5**- V/USrSgS L3: < T J; V ». 
Zcnm. A c f>yy3 4lZlS\^X3-—V ! l±T*rXb(?>mm 

mmZtiZ n-;txti. iMJRSii^^Xh 
^WWXti^r-fe-y h<0#^^*jV^T*!j2 0#^>f > 
X?yx£#tf.Ii:##*U^ ii^tSASCI IT'fc 
S-f>'*^/l^X(i:v^>'nrg!^-c^v^&^. 0k: 
mtsmlza-^xmrnLX b-?>ikLWttHttt> 

7r-b7bs mmu^ju^ a— atxMSn— ^n*x 

oStf?f^. a— T«x-r'y7°3 6tc:*>v^Tv^^iiIM7 
r-fe-y hffi^r?— ^N'X^dfXhco^tP^ttS. 
d^(C. j.-- riiSO^W^X^SrrJ^tTjL-^^ 
XxAl Ot?|^«<'. 

[0023] ^5 0(iXT7 75 2*^4S. d<0 
Xx>y7°tz:fcv^T. ^"n-fe-y^-l 1 (in— ;\°x<7)#T-3f 

(i. ^??ixftJr^-iO#^^LTloOffl&^rr-6 
^iJCTcO^h/l'T'^l.o ra-lr-ylM Hi. ^SOf 
=^x hrt^^nSMiitft^BU'^K^If SI(;*^-> 

t, gitR^^Jt^f^.^^ 

(;«-?^T =Jf jL-M^^-rs^^i^^WiHB^^ 

x'h&tzfr. &mmx'imi,<miL*^zb< l z-t 

h . d^^^^t^ix^fx ho«Jt^flfX(i^ ^f+(t 
^^•CJi^ri^*. 7n*7tl Kix-r-yy5 2fc 

[0024] XT--yy5 4tCit>V^T. 7°a-fe -y-!fl l(i 
7r-fe-y bmizm-oX&ZjL-lzttit&tL&^mfr* 

1 ii#7 r -fe >y Mctt fCS»ft»t^^ Sr^^" 

S!R§ttft^ a-«#*fc#LT 1 owt^tt^M 
tc^? Wt-TfcS. a^'X^^ «y^0^^tf^StOl8: 

*yhC4rtW&. #2l7r-b7hCTLt. 
7n-fe 7tll (iR— 3f jl-^^ h4^)4Hs -y MctN" 
&xiymm.*:M< . n^HBScg (*) 14. TIBOJ; 

g(0)=log{0/l-0) =X>S 
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13 14 

5£tk 0tt7r-fe -y hffft^rc&&'<? h;UOW^T*fe h/Wfc 7 r -fe v ht3^-TStt^#tt^^ h 

•5. 1 -*H7r-t7 hffi**i&-C&-5-fey hfa<?y<.7 h LT. #7r-fe-y h<7)-r- ;3 fXht«(l5att$-^^'S„ 

[0025] 7r*» hifii9:5fc<9? 2W*I±, R— «0^f *7tll li-f^X hOy>y/l- ( ^IgfcXJi^:) £1$ 

v htt£^-S^xhco&£5rTrt:fc> s m-fey-frl x?K54:0^1EaUt7n>yt-7^x?tefeii£ 

ESE^IO^fiiiO-fe -y K 1 - ^ffiOHr «y h&lX^f [0030] jHKSfrfc h— 9 WtV^^nT^T-^X h 

^fA^jS<,!i:tJ:*)s 42l7r-fe7^fc«)<7)l Tn-fe>y»fl ltix-r 77*1 0 2fc3£tr. rmfvT 

teWLTIi, 2(sBJ*ffl*tcJgffl £*i&Vy#^- (McC tt£4EL-C»*. «3*LfcJ:'3fc:. *.x— <0£§|fc:3S-? 

u 1 1 agh, P. ) (Ne 1 de r, \^x**-i&*&^h^\$%mgt,z\$m&X'b*) . 

J. A. ) Generalized Linear #9»ffl§fcLpL<I&^£ 1 ^f2&'^. ifct. Tn-fe v 

Models" (92 JE, 198 9 (Chapman ^1 1 (iXf->y3Tl 0 4(3ttA« iSKSilfc-r^XbtC 

and Hall p u b . ) <9m4*£#BBcD ~ BSSrS^T-te y h£li^£:rn-bX£BBtt l 4-&. 
t. 20 [003 1]t^l0 0fc«ot.i^7r-t7^I 

[00 26] S^#l;:{i>3!fiT&&J:-3fc:. ycucy* 8iJli2t«07r-fe7h^fflv^Mi4. L*»U 

1 l(il^£n^£^ttffiU *n&fck*fS7T-fe-y hoi o fc?J&-£i\ ^2fflcoffic07r-fe.y hSrfflv^TilSJ^ftfe 

*2tttt^U7r-fe7 2l7r-fc7 ht» atretic. 2ffi7r-fe>y hofftlliti. Tro-fey-fl-l 1 

itiot r-t? hofctfxoAfcfJlJ- #*x-r- y y i o 4 ici> 1 r -fc -y h 
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[WilWS] 

1 Title of Invention 

ARTICLE AND METHOD OF AUTOMATICALLY DETERMINING TEXT 
GENRE USING SURFACE FEATURES OF UNTAGGED TEXTS 



2 Claims 

1. A processor implemented method of identifying a text genre of an untagged 
text in machine readable form without structurally analyzing the text, the 
processor implemented method comprising the steps oft 

a) generating a cue vector from the text, the cue vector representing 
occurrences in the text of a first set of nonstructural, surface cues; and 

b) determining whether the text is an instance of a first text genre using 
the cue vector and a weighting vector associated with the first text 
genre. 

3 Detailed Description of Invention 
Field of the Invention 

The present invention relates to computational linguistics. 
Background of the Invention 

The word "genre" usually functions as a literary substitute for "kind of 
text." Text genre differs from the related concepts of text topic and document 
genre Text genre and text topic are not wholly independent. Distinct text 
genres like newspaper stories, novels and scientific articles tend to largely deal 
with different ranges of topics; however, topical commonalties within each of 
these text genres are very broad and abstract. Additionally, any extensive 
collection of texts relating to a single topic almost always includes works of 
more than one text genre so that the formal similarities between them are 
limited to the presence of lexical items. While text genre as a concept is 
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independent of document genre, the two genre types grow up in close historical 
association with dense functional interdependences. For example, a single 
text genre may be associated with several document genres. A short story may 
appear in a magazine or anthology or a novel can he published serially in parts, 
reissued as a hard cover and later as a paper back. Similarly, a document genre 
Hire a newspaper may contain several text genres, like features, columns, 
advice -to -the -lovelorn, and crossword puzzles. These text genres might not 
read as they do if they did not appear in a newspaper, which licenses the use of 
context dependent words like "yesterday" and "local". By virtue of their close 
association, material features of document genres often signal text genre. For 
example, a newspaper may use one font for the headlines of "hard news" and 
another in the headlines of analysis; a periodical may signal its topical content 
via paper stock; business and personal letters can be distinguished based upon 
page lay out; and so on. It is because digitization eliminates these material 
clues ae to text and document genres that it is often difficult to retrieve 
relevant texts from heterogeneous digital text collections. 

The boundaries between textual genres mirror the divisions of social life 
into distinct roles and activities - between public and private, general ist and 
specialist, work and recreation, etc. Genres provide the context that makes 
documents interpretable, and for this reason genre, no less than content, 
shapes the user's conception of relevance. For example, a researcher seeking 
information about supercolliders or Napoleon will care as much about text 
genre as content - she will want to know not just what the source says, but 
whether that source appears in a scholarly journal or in a popular magazine. 

Until recently work on. information retrieval and text classification has 
focused almost exclusively on the identification of topic, rather than on text 
genre. Two reasons explain this neglect. First, the traditional print-based 
document world did not perceive a need for genre classification because in this 
world genres are clearly marked, either intrinsically or by institutional and 
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contextual features. A scientist looking in a library for an article about cold 
fusion need not worry about how to restrict his search to journal articles, which 
arc catalogued and shelved so as to keep them distinct from popular science 
magazines. Second, early information retrieval work with on-line text 
databases focused on small, relatively homogeneous databases in which text 
genre was externally controlled, like encyclopedia or newspaper databases. 
The creation of large, heterogeneous, text databases, in which the lines 
between text genres are often unmarked, highlights the importance of genre 
classification of texts. Topic-based search tools alone cannot adequately 
winnow the domain of a reader's interest when searching a large 
heterogeneous database. 

Applications of genre classification are not limited to the field of 
information retrieval. Several linguistic technologies could also profit from its 
application. Both automatic part of sentence taggers and sense taggers could 
benefit from genre classification because it is well known that the distribution 
of word senses varies enormously according to genre. 

Discussions of literary classification stretch back to Aristotle. The 
literature on genre is rich with classificatory schemes and systems, some of 
which might be analyzed as simple attribute systems. These discussions tend 
to be vague and to focus exclusively on literary forms like the eclogue or the 
novel* and, to a lesser extent, on paraliterary forme like the newspaper crime 
report or the love letter. Classification discussions tend to ignore unlitsrary 
textual types such as annual reports, Email communications and scientific 
abstracts. Moreover none of those discussions make an effort to tie the abstract 
dimensions along which genres are distinguished to any formal features of the 
texts. 

The only linguistic research specifically concerned with quant ideational 
methods of genre classification of texts is that of Douglas Bib or. His work 
includes: Spoken and Written Textual Dimensions in English: Eesolving the 
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Contradictory Findings. Language, 62(2):384-413. 1986; Variation Across 
Speech and Writing, Cambridge University Press, 1988; The Multidimensional 
Approach to Linguistic Analyses of Genre Variati on; An Overview of 
Methodology and Finding, Computers in the Humanities, 26(5-6):33 1-347, 
1992; Using Register-Diversified Corpora for General Language Studies, in 
Using Large Corpora, pp. 179-202 (Susan Armstrong ed. ) (1994); and with 
Edward Finegan, Drift and the Evolution of English Style: A Histo ry of Three 
Genres. Language, 65(1):93-124, 1989. Biber's work is descriptive, aimed at 
differentiating text genres functionally according to the types of linguistic 
features that each tends to exploit. He begins with a corpus that hae been 
hand- divided into a number of distinct genres, such as "academic prose" and 
"general fiction • He then ranks these genres along several textual 
"dimensions" or factors, typically three or five. Biber individuates his factors 
by applying factor analysis to a set of linguistic features, most of them 
syntactic or lexical. These factors include, for example, past-tense verbs, past 
participial clauses and "wh-" questions. He then assigns to his factors general 
meanings or functions by abstracting over the discourse functions that 
linguists have applied assigned to the individual components of each factor; e 
g., as an "informative vs. involved" dimension, a "narrative vs. non-narrative* 
dimension, and so on. Note that these factors are not individuated according to 
their usefulness in classifying individual texts according to genre. A score that 
any text receives on a given factor or set of factors may not be greatly 
informative as its genre because there is considerable overlap between genres 
with regard to any individual factor. 

Jussi Karlgren and Douglass Cutting describe their effort to apply some 
of Biber's results to automatic categorization of genre in Recognizing Text 
Genres with Simple Metric Using Discriminant Analysis, in Proceedings pf 
Coling '94. Volume U. pp. 107 1-1075, Aug. 1994. They too begin with a corpus of 
hand classified texts, the Brown corpus. The people who organized the Brown 
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corpus describe their classifications as generic, but the fit between the texts 
and the genres a sophisticated reader would recognize is only approximate. 
Karlgren and Cutting use either lexical or distributional features - the lexical 
features include first-person pronoun count and present-tense verb count, 
while the distributional features include long-word count and character per 
word average. They do not use punctuational or character level features. Using 
discriminant analysis, the authors classify the texts into various numbers of 
categories. When Karlgren and Cutting used a number of functions equal to the 
number of categories assigned by hand, the fit between the automatically 
derived and hand -classified categories is 51.6%. They improved performance 
by reducing the number of functions and reconfiguring the categories of the 
corpus, Karlgren and Cutting observe that it is not clear that such methods 
will be useful for information retrieval purposes, stating: "The problem with 
using automatically derived categories is that even if they are in a sense real, 
meaning that they are supported by the data, they may be difficult to explain 
for the imeathusiastic layman if the aim is to use the technique in retrieval 
tools." Additionally, it is not clear to what extent the idiosyncratic "genres" of 
the Brown corpus coincide with the categories that users find relevant for 
information retrieval tasks. 

Geoffrey Nunberg and Patrizia Violi suggest that genre recognition will 
be important for information retrieval and natural language processing tasks 
in Text Form and Genre in Proceedings of OED 92. pp. 118-122. October 1992. 
These authors propose that text genre can be treated in terms of attributes, 
rather than classes; however, they offer no concrete proposal as to how 
identification can be accomplished. 
Summary of the Invention 

The method of the present invention for automatically identifying the 
genre_ of a machine readable, untagged, text provides these and other 
advantages. Briefly described, the processor implemented method begins by 
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generating a cue vector from the text, which represents occurrences in the text 
of a first set of nonstructural, surface cues, which arc easily computable. 
Afterward, the processor determines whether the text is an instance of a first 
text genre using the cue vector and a weighting vector associated with the first 
text genre. 

Detailed Description of the Preferred Embodiments 

Figure 1 illustrates in block diagram form computer system 10 in which 
the present method is implemented by executing instructions 100. The present 
method alters the operation of computer system 10, allowing it to 
automatically determine the text genre of an untagged text presented to it in 
machine readable form. Instructions 100 enable text genre classification to 
occur without structural analysis of the text, word stemming or part of speech 
tagging. Instructions 100 rely upon new surface-level cues, or features, which 
can be computed more quickly than structurally based features. Briefly 
described, according to instructions 100, computer system 10 analyzes the text 
to determine the number of occurrences of each surface cue within the text 
generates a cue vector. Computer system 10 then determines whether the text 
is an instance of a particular text genre and/ar facet using the cue vector and a 
weighting vector associated with the particular text genre and/or facet. 
Instructions 100 will be described in detail with respect to Figure 4. Computer 
system 10 determines the appropriate weighting vector for each text genre 
andJor facet using training instructions 50, which will be described in detail 
with respect to Figure 3. 

A. A Computer System for Automatically Determining Text Genre 

Prior to a more detailed discussion of instructions 50 and 100 consider 
computer system 10, which executes those instructions, Illustrated in Figure 1 , 
computer system 10 includes monitor 12 for visually displaying information to 
a computer user. Computer system 10 also outputs information to the 
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computer user via printer 13. Computer system 10 provides the computer user 
multiple avenues to input data. Keyboard 14 allows the computer user to input 
data to computer system 10 by typing. By moving mouse 16 the computer user 
is able to move a pointer displayed on monitor 12. The computer user may also 
input information to computer system 10 by writing on electronic tablet 18 
with a stylus 20 or pen. Alternately, the computer user can input data stored 
on a magnetic medium, such as a floppy disk, by inserting the disk into floppy 
disk drive 22. Scanner 24 allows the computer user to generate machine 
readable versions, e.g. ASCII, of hard copy documents. 

Processor 1 1 controls and coordinates the operations of computer system 
10 to execute the commands of the computer user. Processor 11 determines and 
takes the appropriate action in response to each user command by executing 
instructions, which like instructions 50 and 100, are stored electronically in 
memory, either memory 28 or on a floppy disk within disk drive. Typically, 
Operating instructions for processor 11 are stored in solid state memory, 
allowing frequent and rapid access to the instructions. Semiconductor logic 
devices that can be used to realize memory include read only memories (ROM), 
random access memories (RAM), dynamic random access memories (DRAM), 
programmable read only memories (PROM), erasable programmable read only 
memories (EPROM), and electrically erasable programmable read only 
memories (EE PROM), such as flash memories. 
B. Text Genres. Facets and Cues 

According to instructions 50 and 100, computer system 10 determines 
the text genre of a tokenized, machine readable text that has not been 
structurally analyzed, stemmed, parsed, nor tagged for sense or parts of speech. 
As used herein, a "text genre" is any widely recognized class of texts defined by 
some common communicative purpose or other functional traits, provided that 
the function is connected to some formal cues or commonalties that are not the~ 
direct consequences of the immediate topic that the texts address. Wide 
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recognition of a class of texts enables the public to interpret the texts of the 
class using a characteristic set of principles of interpretation. As used herein, 
text genre applies only to sentential genres; that is, applies only to genres that 
communicate primarily via sentences and sentence like at rings that make use 
of the full repertory of text-category indicators like punctuation marks, 
paragraphs, and the like. Thus, according to the present invention airline 
schedules, stock tables and comic strips are not recognized as text genres. Nor 
does the present invention recognize genres of spoken discourse as text genres. 
Preferably, the class defined by a text genre should be extensible. Thus, 
according to the present invention the class of no vels written by Jane Austen is 
not a preferred text genre because the class is not extensible. 

The methods of instructions 50 and 100 treat text genres as a bundle of 
facets, each of which is associated with a characteristic set of computable 
linguistic properties, called cues or features, which axe observable from the 
forma], surface level, features of texts. Using these cues, each facet 
distinguishes a class of texts that answer to certain practical interests. Facets 
tend to identify text genre indirectly because one facet can be relevant to 
multiple genres. Because any text genre can be defined as a particular cluster 
of facets the present method allows identification of text genres and 
superganres with the same accuracy as other approaches, but with the 
advantage of easily allowing the addition of new, previously unencountered 
text genres. 

Rather than attempting to further define the concept of facets, consider 
a number of illustrative examples. The audience facet distinguishes between 
texts that have been broadcast and those whose distribution was directed to a 
more limited audience. The length facet distinguishes between short and long 
texts. Distinctions between texts that were authored by organizations or 
anonymously and individuals are represented by the author facet. List below 
are ohher Facets and their values, when those values are not obvious. Note 
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facets need not be binary valued. 

Facet Name Possible Values 

1. Date Dated/Undated 

2. Narrative Yes/No 

3. Suasive(ArgumentativG)/Descriptive<Iiiformative) 

4. Fiction/Nonfiction 

5. Legal Yes/No 

6. Science & Technical Yes/No 

7. Brow Popular Yes/No 

Middle Yes/No 
High Yes/No 

Other facate can be defined and added to those listed above consistent 
with the present invention. Not all facets need be used to define a text genre; 
indeed, a text genre could be defined by a single facet. Listed below are but a 
few examples of conventionally recognized text genres that can be defined 
using the facets and values described. 

1. Press Reports 



a. 


Audience 


Broadcast 


b_ 


Date 


Dated 


c. 


Suasive 


Descriptive 


d. 


Narrative 


Yes 


e. 


Fiction 


No 


f. 


Brow 


Popular 


g- 


Author 


Unsigned 


h. 


Science&Technical 


No 


i. 


Legal 


No 
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2. Editorial Opinions 



a. 


Audience 


Broadcast 


b. 


Date 


Dated 


c. 


Suasive 


Yea 


d. 


Narrative 


Yea 


e. 


Fiction 


No 


f. 


Brow 


Popular 


g 


Authorship 


Signed 


h. 


Science &Techinical 


No 


i. 


Legal 


No 



3. Market Analysis 



a. 


Audience 


Broadcast 


b. 


Date 


Dated 


c. 


Suasive 


Descriptive 


d. 


Narrative 


No 


€. 


Fiction 


No 


f. 


Brow 


High 


S- 


Authorship 


Organizational 


b. 


Science and 


Technical Yes 


i. 


Legal 


No 


Email 




a. 


Audience 


Directed 


h. 


Date 


Dated 


c. 


Fiction 


No 


d. 


Brow 


Popular 


e. 


Authorship 


Signed 



Just as text genres decompose into a group of facets, so do facets 
decompose into surface level cues according to the present methods. The 
surface level cues of the present invention differ from prior features because 
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they can be computed using tokenized ASCII text without doing any structural 
analysis, such as word stemming, parsing or sense or part of speech tagging. 
For the most part, it is the frequency of occurrence of these surface level cues 
within a text that is relevant to the present methods. Several types of surface 
level or formal cues can be defined, including, but not limited to: 
numerical/statistical, punctuational, constructional, formulae, lexical and 
deviation. Formulae type cues are collocations or fixed expressions that are 
conventionally associated with a particular text genre. For example, fairy tales 
begin with "Once upon a time" and Marian hymns begin with "Hail Mary." 
Other formulae announce legal documents, licensing agreements and the like. 
Lexical type cues are directed to the frequency of certain lexical items that can 
signal a text genre. For example, the use of formal terms of address like "Mr., 
Mrs. and Ms," are associated with articles in the New York Times; and the use 
of words like "yesterday" and •local" frequently occur in newspaper reports. 
Additionally, the use of a phrase like "it's pretty much a snap" indicate that a 
text is not part of an encyclopedia article, for example. The use of some lexical 
items is warranted by the topical and rhetorical commonalties of some text 
genres. While constructional features are known in the prior art, computation 
of most of them requires tagged or fully parsed text. Two new surface level 
constructional cues are defined according to the present invention which are 
string recognizable. Punctuational type cues are counts of punctuational 
features within a text. This type of cue has not been used previously; however, 
they can serve as a useful indicator of text genre because they are at once 
significant and very frequent. For example, a high question mark count may 
indicate that a text attempts to persuade its audience. In contrast to most 
other cue types, which measure the frequency of surface level features within a 
particular text, deviation type cues relate to deviations in unit size. For 
example, deviation cues can be used to track variations in sentence and 
paragraph length, features that in ay_vary .according to text genre. Cue-types -. 
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have been described merely to suggest the kinds of surface level features that 
can be measured to signal text features; characterization of cue type is not 
important to the present invention* The number of cues that can be defined is 
theoretically unlimited. Just a few of the possible cues are listed below for 
illustrative purposes. 

A. Punctuational Cues 

1, Log (comma count +1 ) 

2, Mean (comma£/eentence6)Sarticle 
8. Mean (dashes/sentences) /article 

4. Log (question mark count +1 ) 

5. Mean (questions/sentences)/article 

6. Log (dash count + 1 ) 

7. Log (semicolon count ■+ 1 ) 

B. String Recognizable Constructional Cues 

1. Sentences starting wl "and" "but" and "so" per article 

2. Sentences starting w/adverb + comma/article 

C. Formulae Cues 

1 . "Once upon a time..." 

D. Lexical Cues (Token counts only are taken unless otherwise 
indicated) 

1. Abbreviations for a Mx., Mrs." etc. 

2. Acronyms 

3. Mod si auxiliaries 

4. Forma of the vorb "be" 

- - 5. - Calendar - days of the~week~ months — - 
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6 , 7, Capital - non-sentence initial words that are capitalized 
Type and Token counts 
8. Number of characters 
9,10. Contractions 

Type and Token counts 
11,12. Words that end in "ed B 
Type and Token counts 

13. Mathematical Formula 

14. Forms of the verb "have 1 * 

15. 16 Hyphenated words 

Type and token counts 
17,18. Polysyllabic words 

Type and token counts 
19. The word "it" 

20,21. Latinate prefixes and suffixes 

Type and token counts 
22,23. Words more than 6 letters 

Type and token counts 
24,25. Words more than 10 letters 

Type and token counts 
26,27. Three + word phrases 

Type and token counts 
28,29. Polysyllabic words ending in "ly" 

Type and token counts 
30. Overt negatives 
31,32. Words containing at least one digit 

Type and token counts 

.83. Left parentheses — - 

34,35 Prepositions 
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Type and token counts 

36. First person singular pronouns 

37. First person plural pronouns 

38. Pair 8 of quotation marks 

39. Roman Numerals 

40. Instances of "that" 

41. Instances of "which" 

42. Second person plural pronouns 



F. Deviation Cues 

1. standard deviation of sentence length in words 

2. . standard deviation of word Length in characters 

3. standard deviation of length of text segments between 
punctuation marks in words 

4. Mean <characters/words) per article 

The result of a preliminary trial with a corpus of approximately four 
hundred texts. Table I of Figure 2 illustrates how some surface level cues can 
vary according to face/text genre. (This trial treated some text genres as a 
single facet, rather than decomposing the text genres as described above. Both 
approaches are consistent with the present invention. As stated previously, a 
text genre may be defined by a single facet.) For example, within this corpus 
press reports^inchided only 1.2 semicolons per article, while legal documents 
included 4.78. Similarly, the number of dashes per text differed among press 
reports, editorial opinions and fiction. 

What weight should be given to different cue values? Or, stated another 
way, how strongly correlative is a cue value, or set of cue values, of a particular 
facet or text genre? In contrast to the decomposition of text genres into facet 
values, which is a matter of human judgment, answering this question is not. 
Determining the weight accorded to each cue according to facet requires 
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training, which is described below with respect to Figure 3. 

C. Training to Determine Cue Weights 

Figure 3 illustrates in flow diagram form training method 30 for 
determining cue weights for each cue, Training method 30 is not entirely 
automatic; steps 32, 34 and 36 are manually executed while those of 
instructions 50 are processor implemented. Instructions 50 may be stored in 
solid state memory or on a floppy disk placed within floppy disk drive and may 
be realized in any computer language, including LISP and C++. 

Training method 30 begins with the selection of a set of cues and another 
set of facets, which can be used to define a set of widely recognized text genres. 
Preferably, about 50 to 55 surface level cues are selected during step 32, 
although a lesser or greater number can be used consistent with the present 
invention. Selection of a number of lexical and puiLctuational type surface love! 
cues is also preferred. The user may incorporate all of the surface level cues 
into each facet defined, although this is not necessary. While any number of 
facets can be defined and selected during step 32, the user must define some 
number of them. In contrast, the user need not define text genres at this point 
because facets by themselves are useful in a number of applications, as will be 
discussed below. Afterward, during step 34 the user selects a heterogeneous 
corpus of texts. Preferably the selected corpus includes about 20 instances of 
each of the selected text genres or facets, if text genres have not been defined. 
If not already in digital or machine readable form, typically ASCII, then the 
corpus must be converted and tokenized before proceeding to instructions 50. 
Having selected facets, surface level cues and a heterogeneous corpus, during 
step 36 the user associates machine readable facet values with each of the texts 
of the corpus. Afterward, the user turns the remaining training tasks over to 
computer system 10. " 

Instructions 50 begin with step 52, during which processor 11 generates 
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a cue vector, X, for each text of the corpus. The cue vector is a multi- 
dimensional vector having a value for each of the selected cues. Processor 11 
determines the value for each cue based upon the relevant surface level 
features observed within a particular text. Methods of determining cue values 
given definitions of the selected cues will be obvious to those of ordinary skill 
and therefore will not be described in detail herein. Because these methods do 
not require structural analysis or tagging of the texts, processor 1 1 expends 
relatively little computational effort in determining cue values during step 52. 

Processor 11 determines the weighting that should be given to each cue 
according to facet value during step 54. In other words, during step 54 
processor 11 generates a weighting vector, P , for each facet. Like the cue 
vector, X, the weighting vector, & # is a multidimensional vector having a value 
for each of the selected cues. A number of mathematical approaches can be 
used to generate weighting vectors from the cue vectors for the corpus, 
including logistic regression. Using logistic regression, processor 11 divides 
the cue vectors generated during step 52 into sets of identical cue vectors. Next 
for each binary valued facet, processor 11 solves a log odds function for each set 
of identical cue vectors. The log odds function, g($), is expressed as:. 

ff(0) = log (0/1-0) = X J3; 

where: 0 is the proportion of vectors for which the facet value is true; 

1- 0 is the proportion of vectors in the set for which the facet 
value is false. 

The processor 11 is able to determine the values of 0 and 1- 4> because earlier 
tagging of facet values indicates the number of texts having each facet value 
within each set of texts having identical cue vectors. Thus, processor 11 can 
determine the values of-weighting vector 0 -for each binary valued facet by 
solving the system of simultaneous equations defined by all the sets of 
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identical cue vectors, the known values of ^ ,1- ) and the cue vector values. 
Logistic regression is well known and will not be described in greater detail 
here. For a more detailed discussion of logistic regression, see Chapter 4 of 
McCnLlagh, P, and Nelder, J. A., Generalized Linear Models. 2d Ed , 1989 
(Chapman and Hall pub ), incorporated herein by reference. 

Processor 11 can use the method just described to generate weighting 
vectors for facets that are not binary valued, like the Brow facet, by treating 
each value of the facet as a binary valued facet, as will be obvious to those of 
ordinary skill. In other words, a weighting vector is generated for each value. of 
a non-binary valued facet. 

Using logistic regression with as large a number of cues as preferred, 
60-55, may lead to ovcrfitting. Further, logistic regression does not model 
variable interactions. To allow modeling of variable interactions and avoid 
overfitting, neural networks can be used during step 54 to generate the 
weighting vectors and may improve performance. However, either approach 
may be used during step 54 consistent with the present invention. 

To enable future automatic identification of text genre, processor 11 
stores, in memory the weighting vectors for each of the selected facets. That 
done, training is complete. 

D. Automatically Identifying Text Genre and Facets 

Figure 4 illustrates in flow diagram form instructions 100. By executing 
instructions 100. processor 11 automatically identifies the text genre of a 
machine readable, untagged, text 11 using set of surface level cues a set of 
facets and weighting vectors. Briefly described, according to instructions 100, 
processor 11 first generates a cue vector for the tokenized machine readable 
text to be classified. Subsequently, processor 11 determines the relevancy of 
each facet to the text using the cue vector and a weighting vector associated 
with the facet. After determining the relevancy of each facet to the text, 
processor 11 identifies the genre or genres of the text. Instructions 100 may be 
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stored in solid state memory or on a floppy disk placed within floppy disk drive 
and may be realized in any computer language, including LISP and C++. 

In response to a user request to identify the genre of a selected tokenized, 
machine readable text, processor 11 advances to step 102. During that step, 
processor 11 generates for the text a cue vector, X, which represents the 
observed values within the selected text for each of the previously defined 
surface level cues. As discussed previously, methods of determining cue values 
given cue definitions will be obvious to those of ordinary skill and need not be 
discussed in detail here Processor 11 then advances to step 104 to begin the 
process of identifying the facets relevant to the selected text. 

According to instructions 100, identification of relevant facets begins 
with the binary valued facets; however, consistent with the present invention 
identification may also begin with the non-binary valued facets. Evaluation of 
the binary valued facets begins with processor 11 selecting one during step 
104. 

Processor 11 then retrieves from memory the weight vector, B , 
associated with the selected facet and combines it with the cue vector, X, 
generated during step 102. Processor 11 may use a number of mathematical 
approaches to combine these two vectors to produce an indicator of the 
relevance of the selected facet to the text being classified, including logistic 
regression and the log odds function. In contrast to its use during training, 
during step 106 processor 11 solves the log odds function to find <i> , which now 
represents the relevance of the selected facet to the text. Processor 11 regards 
a facet as relevant to a text if solution of the log odds function produces a value 
greater than O, although other values can be chosen as a cut-off for relevancy 
consistent with the present invention. 

Having determined the relevancy of one binary valued facet, processor 
11 advances to step 108 to ascertain whether other binary- valued facets 
require evaluation. If so, processor 11 branches back up to step 104 and 
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continues evaluating the relevancy of facets, one at a time, by executing the 
loop of steps 104, 106 and 108 until every binary-valued facet has been 
considered. When that occurs, processor 11 branches from step 108 to step 110 
to begin the process of determining the relevancy of the non-binary valued 
facets. 

Processor 11 also executes a loop to determine the relevance of the 
non-binary valued facets. Treatment of the non-binary valued facets differs 
from that of binary valued facets in that the relevance of each facet value must 
be evaluated separately. Thus, after generating a value of the log odds function 
for each value of the selected facet by repeatedly executing step 114, processor 
11 must decide which facet value is most relevant during step 118. Processor 
11 regards the highest scoring facet value as the most relevant. After 
determining the appropriate facet value for each of the non-binary valued 
facets, processor 11 advances to step 122 from step 120. 

During step 122 processor 11 identifies which text genres the selected 
text represents using the facets determined to be relevant and the text genre 
definitions in terms of facet values. Methods of doing so are obvious to those of 
ordinary skill and need not be described in detail herein. Afterward, processor 
11 associates with the selected text, the text genres and facets determined to 
be relevant to the selected text. While preferred, determination of text genres 
during step 122 is optional because, as noted previously, text genres need not 
be defined because facet classifications are useful by themselves. 
K. Applications fox Text Genre and Facet Classification 

The fields of natural language and information retrieval both present a 
number of applications for automatic classification of text genre and facets. 
Within natural language, automatic text classification will be useful with 
taggers and translation. Within the information retrieval field, text genre 
classification will be useful as a search filter and parameter, in revising 
document format and enhancing automatic summarization. 
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Present sense taggers and part of speech taggers both use raw statistics 
about the frequency of items within a text. The performance of these taggers 
can be improved by automatically classifying texts according to their text 
genres and computing probabilities relevant to the taggers according to text 
genre. For example, the probability that "sore" will have the sense of "angry" or 
that ^cool" will have the sense of "first-rate" is much greater in a newspaper 
movie review of a short story than in a critical biography. 

Both language translation systems and language generation systems 
distinguish between synonym sets. The conditions indicating which synonym of 
a set to select are complex and must be accommodated. Language translation 
system must recognize both the sense of a word in the original language and 
then identity an appropriate synonym in the target language. These difficulties 
cannot be resolved simply by labeling the items in each language and 
translating systematically between them; e.g. # by categorically substituting 
the same "slang" English word for its "slang" equivalent in French. In one 
context the French sentence "11 chcrche un b on lot" might ho translated by "He's 
looking for a gig," in another context by "Ho'b looking for a job ". The sentence 
M ll (re)chercho un travail" might be either "He's looking for a job" or "He's 
seeking employment," and so on. Making the appropriate choice depends on an 
analysis of the genre of the text from which a source item derives. Automatic 
text genre classification can improve the performance of both language 
translation systems and language generation systems. It can do so because it 
allows recognition of different text genres and of different registers of a 
language, and, thus, distinctions between members of many synonym sets. 
Such synonym seta include: "dismiss/fire/can," "rather/pretty," "want/wish,*' 
"buy it/die/decease," "wheels/car/automobilc" and "gig/job/posit ion". 

Most information retrieval system have been' developed using 
homogeneous databases and they lend to perform poorly on heterogeneous 
databases. Automatic text genre classification can improve the performance of 
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information retrieval systems with heterogeneous databases by acting as a 
filter on the output of topic-based searches or as an independent search 
parameter. For example, a searcher might search for newspaper editorials on a 
supercollider, but exclude newspaper articles, or search for articles on LANS in 
general magazines but not technical journals. Analogously, a searcher might 
start with a particular text and ask the search system to retrieve other texts 
similar to it as to genre, as well as topic. Information retrieval systems could 
use genre classification as a way of ranking or clustering the results of a topic 
based search. 

Automatic genre classification will also have information retrieval 
applications relating to document format. A great many document databases 
now include information about the appearance of the electronic texts they 
contain. For example, mark-up languages are frequently used to specify the 
format of digital texts on the Internet. OCR of hardcopy documents also 
produces electronic documents including a great deal of format information. 
However, the meaning of format features can vary within a heterogeneous 
database according to genre. As an example, consider the alternating use of 
boldface and normal type within a text. Within a magazine article this format 
feature likely indicates an interview; within an encyclopedia this same feature 
denotes headings and subsequent text; within a manual this feature may be 
used to indicate information of greater or lesser importance; or still yet, within 
the magazine Wired this format feature is used to distinguish different articles. 
Using automatic text genre classification to determine the meaning of format 
features would be useful in a number of applications. Doing so enables users to 
constrain their searches to major fields or document domains, like headings, 
summaries, and titles. Analogously, determining the meaning of format 
features enables discriminating between document domains of greater and 
lesser importance during automatic document summarization, topic clustering 
and other information retrieval tasks. Determining the meaning of format 
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features alBO enables the representation of digital documents in a new format. 
In a number of situations preservation of original format is impossible or 
undesirable. For example, a uniform format may be desired when generating a 
new document by combining several existing texts with different format styles. 

In a similar vein, automatic genre classification is useful when 
determining how to format an unformatted ASCII text. 

Automatic classification of text genre has a number of applications to 
automatic document summarization. First, some automatic sum mariners use 
the relative position of a sentence within a paragraph as a feature in 
determining whether the sentence should be extracted. However, the 
significance of a particular sentence position varies according to genre. 
Sentences near the beginning of newspaper articles are more likely to be 
significant than those near the end. One assumes this is not the case for other 
genres like legal decisions and magazine stories. These correlations could be 
determined empirically using automatic genre classification. Second, genre 
classification allows tailoring of summaries according to the genre of the 
summarised text, which is desirable because what readers consider an 
adequate summary varies according to genre. Automatic summarizers 
frequently have difficulty determining where a text begins because of prefatory 
material, leading to a third application for automatic genre classification. 
Frequently, prefatory material associated with texts varies according to text 
genre. 

4 Brief Description of the Drawings 

Figure 1 illustrates a computer system for automatically determining 
the text genre of machine readable texts. 

Figure 2 illustrates Table 1, a table of trial observations of surface cue 
values according to facet value. 

Figure 3 illustrates in flow diagram form instructions for training to 
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generate weighting vectors values from a training corpus. 

Figure 4 illustrates in flow diagram form instructions for determining 
the relevance of text genres and facets to a machine readable text. 

Figure 5 illustrates in flow diagram form instruction far presenting 
search results to the user in an order based upon text genre type or facet 
values. 

Figure 6 illustrates in flow diagram form instructions for presenting 
search results to the computer user. 
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FIG.3 
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FIG. 6 
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1 Abstract 

A processor incremented method of identifying the genre of a machine 
readable, untagged text. The processor implemented method begins by 
generating a cue vector form the text which represents occurrences in the text of 
a first set of nonstructural, surface cues, which are easily computable. 
Afterward, ihe processor determines whether the text is an instance of a first text 
genre using the cue vector and a weighting vector associated with the first text 
genre. 



2 Representative Drawing 

Figure 3 



